CharXiv Reasoning

A comprehensive benchmark for chart understanding in multimodal LLMs: 2,323 real-world charts from scientific papers with expert-curated questions

Published

August 30, 2025

Keywords: CharXiv, chart understanding, multimodal LLM, visual reasoning, scientific charts, MLLM evaluation, descriptive questions, reasoning questions, Princeton NLP, NeurIPS 2024

Introduction

Charts are everywhere — in scientific papers, financial reports, dashboards, and presentations. Understanding charts requires more than reading text; it demands visual perception, data extraction, and multi-step reasoning across complex visual elements.

Most existing chart benchmarks use oversimplified, template-generated charts with formulaic questions, leading to over-optimistic estimates of AI progress. Open-source models can appear to outperform strong proprietary models on these benchmarks, yet a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%.

CharXiv addresses this by providing a comprehensive evaluation suite of 2,323 natural, challenging, and diverse charts sourced directly from arXiv scientific papers. All charts and questions are handpicked, curated, and verified by human experts. The result is a far more realistic and faithful measure of chart understanding capabilities.

“All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs.” — CharXiv Paper

graph LR
    A["Existing Chart Benchmarks<br/>(DVQA, FigureQA, ChartQA)<br/>Template-based, oversimplified"] --> B["Over-optimistic<br/>progress measures"]
    B --> C["CharXiv<br/>2,323 real-world charts<br/>Expert-curated questions"]
    C --> D["Realistic signal<br/>for chart understanding"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is CharXiv?

CharXiv (Chart + arXiv) is a comprehensive evaluation benchmark for chart understanding in Multimodal Large Language Models (MLLMs). It consists of 2,323 high-resolution charts manually sourced from arXiv preprints, each paired with expert-curated questions that test both basic comprehension and complex reasoning.

Two Types of Questions

CharXiv tests two fundamentally different capabilities:

Descriptive Questions — Examine basic chart elements (axis labels, legends, data values, chart type identification). Each chart has 4 descriptive questions (3 answerable + 1 unanswerable designed to test whether models can recognize when information is not available).
Reasoning Questions — Require synthesizing information across complex visual elements, performing multi-step reasoning, comparing trends, and drawing conclusions. Each chart has 1 reasoning question.

This gives a total of ~11,600 questions across the full dataset (5 questions × 2,323 charts).

Key Characteristics

Feature	Details
Total charts	2,323 (sourced from arXiv preprints)
Validation set	1,000 charts / 5,000 questions (used for leaderboard)
Test set	1,323 charts / 6,615 questions
Question types	Descriptive (4 per chart) + Reasoning (1 per chart)
Answer format	Open-vocabulary short answers (easily verifiable)
Chart diversity	Line, bar, scatter, heatmap, box plot, radar, and more
Source	Real scientific charts from arXiv papers
Curation	All handpicked and verified by human experts
Evaluation	Zero-shot, natural instructions, automated scoring
Venue	NeurIPS 2024

graph TD
    CX["CharXiv<br/>2,323 charts from arXiv"] --> D["Descriptive Questions<br/>(4 per chart)"]
    CX --> R["Reasoning Questions<br/>(1 per chart)"]
    D --> D1["Answerable (3)<br/>Axis labels, legends,<br/>data values"]
    D --> D2["Unanswerable (1)<br/>Tests refusal ability"]
    R --> R1["Multi-step reasoning<br/>Trend comparison,<br/>data synthesis"]

    style CX fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style R fill:#27ae60,color:#fff,stroke:#333
    style D1 fill:#6cc3d5,color:#fff,stroke:#333
    style D2 fill:#8e44ad,color:#fff,stroke:#333
    style R1 fill:#56cc9d,color:#fff,stroke:#333

Who Built It?

CharXiv was developed by researchers at Princeton University’s Natural Language Processing Group (Princeton NLP), with contributions from the University of Wisconsin-Madison.

Authors

Zirui Wang (lead), Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu — Princeton University
Haotian Liu — University of Wisconsin-Madison
Sadhika Malladi, Alexis Chevalier — Princeton University
Sanjeev Arora, Danqi Chen — Princeton University (senior leads)

Publication

CharXiv was published at NeurIPS 2024, one of the premier machine learning conferences. The paper spans 121 pages with 90 figures, providing an exceptionally thorough analysis of chart understanding gaps across dozens of models.

Resource	Link
arXiv paper	arxiv.org/abs/2406.18521
Project page	charxiv.github.io
GitHub	github.com/princeton-nlp/CharXiv
License	CC BY-SA 4.0 (questions); chart copyrights belong to original arXiv authors

What Skills Does It Test?

CharXiv tests the complete pipeline of chart understanding — from basic visual perception to complex multi-step reasoning. Unlike benchmarks that test specialized knowledge, CharXiv reveals whether AI models can actually read and reason about charts the way humans do.

graph TD
    CX["CharXiv<br/>Chart Understanding"] --> VP["Visual Perception<br/>Chart type, layout,<br/>color mapping"]
    CX --> DE["Data Extraction<br/>Reading values,<br/>axis labels, legends"]
    CX --> TR["Trend Recognition<br/>Patterns, comparisons,<br/>outliers"]
    CX --> MR["Multi-step Reasoning<br/>Synthesizing across<br/>visual elements"]
    CX --> RA["Refusal Ability<br/>Recognizing when<br/>info is unavailable"]

    style CX fill:#e74c3c,color:#fff,stroke:#333
    style VP fill:#3498db,color:#fff,stroke:#333
    style DE fill:#27ae60,color:#fff,stroke:#333
    style TR fill:#f39c12,color:#fff,stroke:#333
    style MR fill:#8e44ad,color:#fff,stroke:#333
    style RA fill:#e67e22,color:#fff,stroke:#333

Capability	What CharXiv Tests	Question Type
Visual perception	Identifying chart types, layout elements, color codes	Descriptive
Data extraction	Reading specific values from axes, legends, and data points	Descriptive
Refusal ability	Recognizing when requested information is not in the chart	Descriptive (unanswerable)
Trend analysis	Comparing trends across multiple series or time periods	Reasoning
Multi-step reasoning	Combining multiple chart elements to draw conclusions	Reasoning
Cross-element synthesis	Integrating information from different parts of a complex chart	Reasoning

Why Existing Benchmarks Fall Short

Existing chart benchmarks like DVQA, FigureQA, and ChartQA subsets in MathVista use template-generated charts with predictable structures. Models can exploit these patterns without truly understanding charts. CharXiv exposes this by using:

Real scientific charts with diverse and complex layouts
Expert-curated questions that cannot be answered by pattern matching
Unanswerable questions that test whether models know their limits

Current Leaderboard

The leaderboard below shows model performance on the CharXiv validation set (1,000 charts, 5,000 questions), evaluated in a zero-shot setting with natural instructions. We highlight the Reasoning accuracy (the harder and more discriminating metric) alongside Descriptive accuracy.

Source: CharXiv Leaderboard (consulted March 28, 2026). All models evaluated zero-shot on the validation set.

Top Performers

Rank	Model	Type	Size	Reasoning (%)	Descriptive (%)
—	Human	—	—	80.50	92.10
1	o3 (high)	Proprietary	—	78.60	95.00
2	o4 mini (high)	Proprietary	—	72.00	94.30
3	Claude 3.7 Sonnet	Proprietary	—	64.20	—
4	Claude 3.5 Sonnet	Proprietary	—	60.20	84.30
5	GPT 4.1 mini	Proprietary	—	56.80	88.40
6	GPT 4.1	Proprietary	—	56.70	87.90
7	GPT 4.5	Proprietary	—	55.40	90.00
8	o1 (high)	Proprietary	—	55.10	88.90
9	Doubao 1.5 Pro	Proprietary	—	54.40	84.30
10	o1	Proprietary	—	52.60	87.45

Top Open-Source Models

Rank	Model	Size	Reasoning (%)	Descriptive (%)
1	Qwen2.5-VL 72B	72B	49.70	87.40
2	InternVL3 38B	38B	46.40	87.20
3	InternVL3 78B	78B	46.00	85.10
4	InternVL3 14B	14B	43.10	82.20
5	Qwen2-VL 72B	72B	43.00	81.35
6	Qwen2.5-VL 7B	7B	42.50	73.90
7	Pixtral 12B	12B	42.40	68.12
8	InternVL V2.5 38B	38B	42.40	79.60
9	InternVL V2.5 78B	78B	42.40	82.30
10	GPT 4.1 nano	—	40.50	73.90

Key Observations

graph LR
    A["Descriptive Tasks<br/>Top models: 87–95%<br/>Close to human (92%)"] --> C["Chart basics<br/>are becoming<br/>tractable"]
    B["Reasoning Tasks<br/>Top model: 78.6%<br/>Human: 80.5%"] --> D["Reasoning gap<br/>is closing but<br/>still significant"]
    E["Open-source<br/>Best: 49.7%<br/>reasoning"] --> F["Large gap vs<br/>proprietary models<br/>(78.6%)"]

    style A fill:#27ae60,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#3498db,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333
    style E fill:#8e44ad,color:#fff,stroke:#333
    style F fill:#e74c3c,color:#fff,stroke:#333

Descriptive accuracy is becoming tractable — Top proprietary models score 87–95%, approaching human performance (92%)
Reasoning remains the bottleneck — The best model (o3 high) scores 78.6%, close to human level (80.5%), but most models score well below 60%
Large proprietary-to-open gap on reasoning — The best open-source model (Qwen2.5-VL 72B at 49.7%) lags significantly behind o3 (78.6%)
Domain-specific models underperform — Specialized chart models (ChartLlama, ChartGemma, etc.) score below 15%, far worse than general-purpose MLLMs
Model scale matters for open-source — 72B+ models consistently outperform smaller variants on reasoning

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource	Description	Link
CharXiv Leaderboard	Official leaderboard with reasoning and descriptive breakdowns	charxiv.github.io/#leaderboard
CharXiv Project Page	Full introduction, examples, and music video overview	charxiv.github.io

Dataset and Code

Resource	Description	Link
Hugging Face Dataset	Full 2,323-chart dataset with questions and annotations	huggingface.co/datasets/princeton-nlp/CharXiv
GitHub Repository	Evaluation code, model configs, and documentation	github.com/princeton-nlp/CharXiv
arXiv Paper	Full 121-page technical paper with analysis	arxiv.org/abs/2406.18521
CSV Results	Downloadable validation results for all models	charxiv.github.io/data/val_result.csv

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("princeton-nlp/CharXiv")

# Access validation set
val = dataset["validation"]
print(f"Validation charts: {len(val)}")

Reasoning Question Breakdown

CharXiv reports reasoning accuracy broken down by sub-categories, revealing where models struggle most:

Sub-category	What It Tests
Information Retrieval	Extracting specific values from complex charts
Comparison	Comparing data points, trends, or categories
Pattern Recognition	Identifying visual patterns across data series
Counting	Enumerating elements in dense or complex charts
Inference	Drawing conclusions not explicitly shown in the chart

Why CharXiv Matters

graph LR
    A["Template-based<br/>benchmarks"] --> B["Inflated scores<br/>on simple charts"]
    B --> C["CharXiv exposes<br/>real gaps"]
    C --> D["Better multimodal<br/>AI systems"]

    A2["Reasoning gap<br/>overlooked"] --> B2["Models can describe<br/>but not reason"]
    B2 --> C
    C --> D2["Targeted research<br/>on chart reasoning"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Exposes inflated progress — Reveals that high scores on existing benchmarks don’t translate to real chart understanding
Separates description from reasoning — Shows that models can extract data but struggle to reason about it
Uses real-world charts — Scientific charts from arXiv are far more complex and diverse than template-generated ones
Tests refusal ability — Unanswerable questions reveal whether models confabulate when information is missing
Expert-curated quality — Every chart and question verified by human experts, ensuring meaningful evaluation
Covers 97 models — The most comprehensive chart understanding leaderboard available

Video: CharXiv Reasoning Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

CharXiv reveals a critical truth about multimodal AI: being able to describe a chart is not the same as understanding it.

2,323 real scientific charts from arXiv with ~11,600 expert-curated questions
Built by Princeton NLP (Zirui Wang, Danqi Chen, Sanjeev Arora, and team), published at NeurIPS 2024
The best model (o3 high) scores 78.6% on reasoning — approaching but not yet matching human performance of 80.5%
Most models score well below 60% on reasoning, despite achieving 85%+ on descriptive questions
Open-source models lag significantly — the best (Qwen2.5-VL 72B at 49.7%) is nearly 30 points behind the best proprietary model on reasoning
Domain-specific chart models underperform general-purpose MLLMs, suggesting that targeted chart training alone is insufficient

As multimodal AI advances, CharXiv provides a rigorous, realistic benchmark for measuring genuine chart understanding — not just pattern matching on simplified templates. The gap between descriptive and reasoning performance highlights the fundamental challenge ahead: teaching AI to truly reason about visual data.

References

Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., Chen, D. “CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs.” arXiv preprint arXiv:2406.18521 (2024). arxiv.org/abs/2406.18521
Princeton NLP. “CharXiv — Project Page and Leaderboard.” charxiv.github.io (consulted March 28, 2026)
Princeton NLP. “CharXiv Dataset.” Hugging Face. huggingface.co/datasets/princeton-nlp/CharXiv
Princeton NLP. “CharXiv GitHub Repository.” github.com/princeton-nlp/CharXiv

Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
Compare with the AGI fluid intelligence benchmark — see ARC-AGI-2
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
CharXiv Official Leaderboard
CharXiv Dataset on Hugging Face
CharXiv GitHub Repository
CharXiv arXiv Paper (121 pages)